README NGS project.

Introduction:

Argonauts are a proteins family involved in gene silencing, they recruit miRNAs which are small RNAs used as guides to find the target RNA and destroy it.

C.elegans has 25 different Argonaute proteines, these interact with various miRNAs.

Brown et al. 2017 generated mutants of several Argonaute proteins and sequenced their transcriptomes to compare them to the wild type.

Nevertheless, C. elegans being a fast developing organism, individuals from a sample can have various ages. Therefore there is an impact of the development stage in the variation observed between strains. A cutting edge tool will be used to assess the degree of impact of development on the results.

Software:

OS

Ubuntu 20.04 LTS.

Tools:

  • sratoolkit.2.10.0
  • FastQC v0.11.8
  • MultiQC 1.9
  • Trimmomatic O.39
  • Salmon 0.14.1
  • RAPToR 1.1.4

RNA-seq data download:

Link to the data (Geo database, accession code: GSE98935) https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98935

Strains (Subset of the article’s data): WT, alg-2(ok304) II (Argonaute 2 mutant) and alg-5(ram2) I (Argonaute 5 mutant). There are 3 replicates of each strain.

Data was downloaded using the Download_data.sh script. Using the fastq-dump command.

Fasta def lines were changed (defline-seq and defline-qual):

Control the reads quality:

1) Fastqc analysis:

We used Fastqc.sh (one report per sample). Principal command:

fastqc $file -o . -t 6

2) Multiqc analysis:

We used Multiqc.sh (on Fastqc directory output, one report for all samples). Principal command:

multiqc $data/* -o . 

3) Results:

Adapters sequences are present in our reads, that’s why in the next step we will remove them.

multiQC report on raw RNA-seq data

Remove adapters from reads:

1) Find adapters sequences:

Adapters’ sequences were found in the supplementary data of Brown et al, 2017.

We generated reverse complement sequences using the online tool: http://reverse-complement.com/

Sequences were written on a Fasta file: Adapter_sequences.fa

2) Use Trimmomatic to remove adapters:

We used Trimmomatic.sh. Principal command:

java -jar $TRIMMOMATIC_JAR PE -phred33 $input_1 $input_2 $output_1_paired \
  $output_1_unpaired $output_2_paired $output_2_unpaired ILLUMINACLIP:$adapter:2:30:10\
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 

ILLUMINACLIP remove adapters, LEADING remove leading low quality bases, TRAILING remove trailing low quality bases, SLIDINGWINDOW scan the read with a 4-base sliding window and cut when the average quality per base drops below 15.

Trimmomatic generate two types of outputs, paired sequences (R1+R2) and unpaired sequences (R1 or R2 alone). For the rest of the analysis we will only keep paired sequences.

3) Control the reads quality after trimming:

I combined the two scripts of Fastqc.sh and Multiqc.sh to form the Control_qual_trim.sh. We see on the Multiqc report that there are no (or almost no) more adapters , and that the read lengths are not as homogeneous as before.

multiQC report after Trimmomatic

Transcript expression quantification:

1) Get the reference transcriptome:

The reference transcriptome of C.elegans was downloaded: https://www.ensembl.org/info/data/ftp/index.html

We used the wget command with this link: ftp://ftp.ensembl.org/pub/release-101/fasta/caenorhabditis_elegans/cdna/Caenorhabditis_elegans.WBcel235.cdna.all.fa.gz

Create an index using the Transcriptome_index.sh script

2) Run Salmon analysis:

We used Salmon to quantify the expression of transcripts in each sample:

Run salmon on experimental data using Salmon.sh. Principal command:

salmon quant -i $data/index_transcriptome -l A \
         -1 $input_1 \
         -2 $input_2 \
         -p 8 --validateMappings -o $quants/${name}_quant

-l is for the library type. -p is the number of threads that will be used for quasi mapping, quantification and bootstrapping. –validateMapping use a more sensitive and accurate mapping algorithm and run an extension alignment dynamic program on the potential mappings it produces.

3) Generate a Multiqc report with Salmon output data:

We used the Multiqc_after_salmon.sh script.

Ajouter captures du rapport multiqc

Import data:

We used tximport to import Salmon quantifications, it creates a matrix with abundance, counts and transcript length. After that, we put the matrix in a R object, so it’s possible to load the object for future use.

The script is: Tximport.R

tx2gene = wormRef::Cel_genes[, c("transcript_name", "wb_id")] #load transcript IDs and Gene IDs from wormRef
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)

Differential expression analyses:

1) Run DESeq:

DESeq runs a statistical test on gene expression between WT and mutants. First we estimate the variance from the replicates of each condition. We use counts data with DESeq and adapted the matrix as necessary for DESeq.

DESeq_data <- DESeqDataSetFromMatrix(countData = salmon_matrix$counts, #Counts data from tximport
                              colData = salmon_matrix$pdata, #information about samples
                              design= ~ strain) #Conditions 

We set the pvalue to 0.05 (as done in Brown et al. 2017).

res_alg2 <- results(dds, name="strain_alg.2.ok304..II_vs_WT", alpha=0.05)

The WT dataset is the reference, if the Log2 FoldChange is >0 then the genes are upregulated in the mutant. If Log2 FoldChange is <0 then the genes are downregulated in the mutant.

Alg2 mutant vs WT

Alg5 mutant vs WT

We saved the differentially expressed gene names in csv files.

2) Gene ontology enrichment:

We used wormbase enrichment tool to assess the enrichment of mutants upregulated and downregulated genes. https://wormbase.org/tools/enrichment/tea/tea.cgi

1- alg2 mutant:

Upregulated genes: (The results differ from the article’s, they used different tools, it’s not surprising to see a variability)

Alg2 Upregulated genes Alg2 Upregulated genes

Downregulated genes: (Fit the article’s results)

Alg2 Downregulated genes Alg2 Downregulated genes

2- alg5 mutant:

Upregulated genes: (No upregulated genes in the article)

Alg5 Upregulated genes Alg5 Upregulated genes

Downregulated genes: (Fit the article’s results)

Alg5 Downregulated genes Alg5 Downregulated genes

Evaluate the impact of development:

1) Install RAPTOR and wormRef on R:

BiocManager::install("limma")
devtools::install_github("LBMC/RAPToR", build_vignettes = TRUE) devtools::install_github("LBMC/wormRef")

2) Load wormref:

We choose: cell larval to Young Adult as the right wormRef dataset because our samples are at L4 stage and this dataset covers it. We used 500 as the resolution of the interpolated reference.

ref_larv <- prepare_refdata("Cel_larv_YA", "wormRef", n.inter = 500) 

3) Run RAPToR:

We used RAPToR with abundance data, transcripts per million (TPM).

ae_data <- ae(samp = salmon_matrix$tpm,                         #input gene expression matrix
                          refdata = ref_larv$interpGE,            #interpolated gene expression
                          ref.time_series = ref_larv$time.series) #Reference time series  

Estimated age of each condition

Quantify the impact of development for each strain:

We used two functions for that: - One to get the indices/GExpr of the reference matching sample age estimates.

getrefTP <- function(ref, ae_obj, ret.idx = T){
  idx <- sapply(ae_obj$age.estimates[,1], function(t) which.min(abs(ref$time.series - t)))
  if(ret.idx)
    return(idx) 
  return(ref$interpGE[,idx])}
  • And then one to compute the reference changes and the observed changes. (log1p samples input)
refCompare <- function(samp, ref, ae_obj, fac){
  ovl <- format_to_ref(samp, getrefTP(ref_larv, ae_data, ret.idx = F))
  lm_samp <- lm(t(ovl$samp)~fac)
  lm_ref <- lm(t(ovl$ref)~fac) #Operates a statistical test, but we only keep the log2 fold change.
  return(list(samp=lm_samp, ref=lm_ref, ovl_genelist=ovl$inter.genes)) }

log1p_samp <- log1p(salmon_matrix$tpm)

rc <- refCompare(log1p_samp, ref_larv, ae_data, strain_groups)

From the output we measured correlation coefficients:

  • Alg2 vs Ref for all genes: 0.6105381

  • Alg2 vs Ref for DE genes: 0.7321606

  • Alg5 vs Ref for all genes: 0.09306767

  • Alg5 vs Ref for DE genes: 0.1953245

And we plotted the log1p Fold Change.

Alg2 vs Ref log1p Fold Change

Alg5 vs Ref log1p Fold Change